RMTLysPTM: recognizing multiple types of lysine PTM sites by deep analysis on sequences

Abstract Post-translational modification (PTM) occurs after a protein is translated from ribonucleic acid. It is an important living creature life phenomenon because it is implicated in almost all cellular processes. Identification of PTM sites from a given protein sequence is a hot topic in bioinformatics. Lots of computational methods have been proposed, and they provide good performance. However, most previous methods can only tackle one PTM type. Few methods consider multiple PTM types. In this study, a multi-label classification model, named RMTLysPTM, was developed to recognize four types of lysine (K) PTM sites, including acetylation, crotonylation, methylation and succinylation. The surrounding sites of a lysine site were selected to constitute a peptide segment, representing the lysine at the center. Deep analysis was conducted to count the distribution of 2-residues with fixed location across the four types of lysine PTM sites. By aggregating the distribution information of 2-residues in one peptide segment, the peptide segment was encoded by informative features. Furthermore, a prediction engine that can precisely capture the traits of the above representations was designed to recognize the types of lysine PTM sites. The cross-validation results on two datasets (Qiu and CPLM training datasets) suggested that the model had extremely high performance and RMTLysPTM had strong generalization ability by testing it on protein Q16778 and CPLM testing datasets. The model was found to be generally superior to all previous models and those using popular methods and features. A web server was set up for RMTLysPTM, and it can be accessed at http://119.3.127.138/.


INTRODUCTION
Post-translational modification (PTM) is a one of the largest stages in protein biosynthesis [1].It occurs after a protein is translated from ribonucleic acid.PTM can change the physical and chemical properties of proteins through specific modifications, and it is implicated in almost all cellular processes.To date, hundreds of PTM types have been discovered and several of them have been identified to be related to diseases, such as cancer and neurological disorders [2].Due to its importance in basic research and drug development, PTM is always a hot topic in protein science.
Among the discovered PTM types, the modification at lysine (K), also named K-PTM, is one of the most frequently observed and special PTM types.Lysine can be annotated by multiple types of PTM, including acetylation, biotinylation, butyrylation, crotonylation, methylation, propionylation, succinylation, ubiquitination and ubiquitin-like modifications [3].In the past, biological experimental methods, such as mass spectroscopy and phosphorspecific antibody, were used to determine the PTM types.A solid determination can be obtained through these methods.However, their defect is very evident.Much time and high cost are needed to conduct these methods.With the coming of post-genome era, a great deal of protein sequences have emerged.These methods cannot process many sequences in time.Thus, designing fast and reliable methods to deal with such problem is needed.
In the past 20 years, in silico methods have become an alternative method to recognize PTM types.Several computational methods have been proposed to identify different lysine modifications, including acetylation [4][5][6][7][8], crotonylation [9][10][11][12][13], methylation [14][15][16][17] and succinylation [18][19][20][21][22].These methods investigate one type of lysine PTM sites individually.For example, studies on acetylation did not consider crotonylation nor methylation.Combining the identification of different types of lysine PTM sites into a unified method is feasible because these types of modification involve lysine.In 2016, Qiu et al. [3] proposed the first computational method, iPTM-mLys, to identify four types of lysine PTM sites, including acetylation, crotonylation, methylation and succinylation.This method contains four procedures.Each procedure is in charge of identifying one type of lysine PTM site.A simple undersampling scheme is used to tackle the imbalanced problem.The results of the four procedures are combined as the final output of iPTM-mLys.Later, the following three methods were designed in a similar manner: predML-Site [23], mLysPTMpred [24] and iMul-kSite [25]; they improved iPTM-mLys by employing more powerful sampling schemes and more suitable single-label classification algorithms.The above methods have a common trait.They tackle four types of lysine PTM individually and divide the problem into four binary classification problems.Thus, they ignore the mutual inf luence of different types of lysine PTM sites.The other two methods, CNN + SGT [26] and MLysPRED [27], are Figure 1.Flow chart of the RMTLysPTM.Four types of lysine PTM sites are considered, including acetylation, crotonylation, methylation and succinylation.The peptide segments containing lysine at center are used to represent the lysine.From the given dataset, the distribution information of 2-residue with fixed location is counted across four types and this information is adopted to encode peptide segments.A novel prediction engine is designed to recognize the types of lysine PTM sites.designed in different manners.They extract more features from the sequences and directly apply multi-label classification algorithms [CNN in CNN + SGT and multi-label K-nearest neighbor classification algorithm (MLKNN) in MLysPRED] to make prediction.These two methods employ the sampling schemes to deal with imbalanced problems.The performance of all the existing methods was tested on the dataset reported in Qiu et al.'s study.However, it was not very high.The absolute true did not exceed 0.9, thus they could be improved.
In this study, data on lysine PTM sites collected in Qiu et al.'s study [3] were adopted, and they are denoted as Qiu dataset.By applying the sliding window technique, a peptide segment with 27 sites was obtained for each lysine.The distribution information of 2-residues with fixed location across four types of lysine PTM sites was counted by deep analysis on the sequences of peptide segments, and this information was further used to encode each peptide segment.On the basis of such representation, a prediction engine that can capture the traits of the representation was specially designed to recognize the types of lysine PTM sites.The developed model was called RMTLysPTM.Cross-validation tests indicated that the model had extremely high performance and strong generalization ability because it provided competitive prediction results on protein Q16778.The comparison results suggested that RMTLysPTM generally outperformed all the existing previous models and other models built using popular methods and features.Furthermore, RMTLysPTM was tested on two recently proposed datasets, CPLM training and testing datasets, and the outcomes indicated the high performance of RMTLysPTM.

MATERIALS AND METHODS
The entire procedures for developing RMTLysPTM are illustrated in Figure 1.The detailed descriptions on each procedure are listed in this section.

Benchmark dataset
The lysine PTM sites investigated in this study were retrieved from the study of Qiu et al. [3], who collected 6394 sites from UniProt (http://www.uniprot.org/).These sites comprise the dataset, denoted as S, and they involved 1763 human proteins.In accordance with the original study, these lysine PTM sites were divided into four PTM types: acetylation, crotonylation, methylation and succinylation.However, several sites were not annotated by these PTM types.These sites constitute the fifth class, termed as 'other'.An upset graph was plotted to show the intersection of sites annotated by any of the four types or no type, as shown in Figure 2. The results showed that acetylation sites were annotated the most, followed by succinylation sites, whereas crotonylation and methylation sites were much less annotated than the above two types.Several sites belong to two or three types of PTM sites, whereas no sites belong to all four types.Clearly, recognizing the types of lysine PTM sites is a multi-label classification problem.In addition, 1750 lysine sites do not belong to any of the above PTM types.In the following formulation, the sets consisting of acetylation, crotonylation, methylation and succinylation sites were denoted as S acetylation , S crotonylation , S methylation and S succinylation , respectively, whereas the other sites constitute the set S other .Accordingly, the dataset S can be formulated by The lysine site only refers to a single amino-acid residue lysine in a protein sequence.Evidently, this information is insufficient to construct classification models.The sliding window technique is widely used in the field of PTM, and it was also employed in this study.From a protein sequence, the peptide segments that contain lysine at the center are extracted as follows: where L refers to the sliding window size, which was set to 13 in this study, as suggested in Qiu et al.'s study [3].Such peptide segment clearly contains three parts: (i) lysine, (ii) upstream L residues of lysine in the sequence and (iii) downstream L residues of lysine in the sequence.In particular, if the upstream or downstream residues are less than L, the nearest residue is adopted to fill the locations.Finally, the dataset S consisting of 6394 peptide segments with length 27 was obtained, each of which contained lysine at the center.These peptide segments can be found in the supplementary files of Qiu et al.'s study [3].
A recently constructed dataset used in [27], which was retrieved from CPLM 4.0 [28], a data resource for various PTMs specifically occurring at the side-chain amino group of lysine residues in proteins, was further employed to fully test the proposed model.A total of 18 978 human protein sequences containing at least one lysine modified by acetylation, crotonylation, methylation or succinylation were downloaded from CPLM 4.0.After the data cleaning procedure described in [27] was conducted, 7057 human protein sequences were kept.The lysine sites modified by acetylation, crotonylation, methylation or succinylation were divided into two parts, 70% of them constituted the training dataset, whereas the remaining 30% data comprised the testing dataset.An upset graph for the lysine sites in the training and test datasets was separately plotted, as shown in Figure 3.Such dataset was much larger than the previous dataset, indicating that the test results on such dataset were more reliable.Furthermore, several lysine sites can be modified by all four types, and the lysine sites that cannot be modified by any PTM type were not included, which are the major differences from the previous dataset.Testing these two different datasets can fully evaluate the model's performance.For easy description, the training and testing datasets reported in [27] are called CPLM training and testing datasets, respectively, whereas the dataset obtained from Qiu et al.'s study [3] is called Qiu dataset.The CPLM training and testing datasets can be obtained at http://47.100.136.41:8181/dataSet.

Feature construction
Feature representation is an important step to develop efficient classification models.In accordance with the Benchmark dataset section, the upstream and downstream residues of the investigated lysine sites in Qiu dataset were selected to constitute a peptide segment with length 27.How to extract informative features from such peptide segment was a challenging problem.Inspired by k-mer in DNA research, the peptide segment was separated into several k-residues consisting of continuous k-residues in the peptide segment.For example, given a peptide segment 'CAVSSIRTLRQLGKKTVVVNCNPETVS', its k-residues were 'CA, AV, VS, . . ., VS' when k = 2.In this study, a novel scheme was proposed to extract informative features based on the k-residues.A huge number of combinations would exist when k is large because 20 different amino acids are present.Accordingly, k was set to 2 in this study.
S tr = {P 1 , P 2 , • • • , P a } is assumed to be a training dataset, where a is the number of training peptide segments and P i is the i-th peptide segment containing lysine at the center.These training peptide segments were classified into five subsets in accordance with the type of lysine PTM site.The first four subsets contain the peptide segments of acetylation, crotonylation, methylation and succinylation, denoted as S tr acetylation , S tr crotonylation , S tr methylation and S tr succinylation , respectively, and the last subset contained the remaining peptide segments, indicated by S tr other .For any 2-residue, such as 'CA', if it frequently occurs in the peptide segments of one type of PTM site, the test peptide segment containing such 2-residue is more likely annotated by such PTM type.However, the location of the 2residue is an important information, which should be also included.In detail, the same 2-residue close to the center lysine and far away from the center lysine should be considered as different 2-residues.Thus, the 2-residue was denoted as αβ i, i + 1 , where α and β stand for two amino acids in 2residue, and i the locations of α and β in the peptide segment, as the subscripts in Eq. 2. For αβ i, i + 1 , the number of training peptide segments, which contain it at the same location, is counted.Such entry is represented by N αβ i, i + 1 , i.e.
where αβ i, i + 1 , P k is defined as Furthermore, the number of peptide segments in four subsets, i.e. S tr acetylation , S tr crotonylation , S tr methylation and S tr succinylation , is counted as follows: The above entries ref lect the distribution of αβ i, i + 1 across four types of PTM sites.If one of the above entry is large, the test peptide segment containing αβ i, i + 1 at the same location is more likely to be annotated by the corresponding PTM type.Meanwhile, if the above four entries are small, the test peptide segment may not be annotated by any PTM type.Thus, these entries are useful information to recognize the types of lysine PTM sites.However, direct use of them is not a perfect method because the ranges are relatively different for various 2-residues.Thus, they were further refined as follows: Entries in Eq. 6 indicate the proportions of peptide segments that contain αβ i, i + 1 at the same location, which are all between 0 and 1.These values are suitable to represent αβ i, i + 1 .Accordingly, αβ i, i + 1 can be represented by a four-dimension vector, formulated as For a 2-residue KA(−L, −L + 1), the number of peptide segments with R −L = K and R −L+1 = A (Eq. 2 shows the formulation of peptide segment) in the training dataset is counted.When this value is 500, that is, N (KA (−L, −L + 1)) = 500, then the numbers of such peptide segments containing lysine sites modified by acetylation, crotonylation, methylation and succinylation are further counted.When they are 50, 10, 400 and 0, i.e.N acetylation (KA (−L, −L + 1)) = 50, N crotonylation (KA (−L, −L + 1)) = 10, N methylation (KA (−L, −L + 1)) = 400 and N succinylation (KA (−L, −L + 1)) = 0, respectively, the above values induce four items: ρ acetylation (KA (−L, −L + 1)) = 50 500 = 0.1, ρ crotonylation (KA (−L, −L + 1)) = 10 500 = 0.02, ρ methylation (KA(−L, −L + 1)) = 400 500 = 0.8 and ρ succinylation (KA (−L, −L + 1)) = 0 500 = 0.These items constitute a four-dimension vector [0.1, 0.02, 0.8, 0] T to represent KA(−L, −L + 1).For a training or test peptide segment with R −L = K and R −L+1 = A, this vector is used to represent these two residues in the peptide segment.By deep analysis of the training peptide segments, all 2-residues at each location can be encoded into a four-dimension vector.These features contain not only the location information of the 2-residue in training peptide segments but also the label information of training peptide segments.They are useful to determine the types of lysine PTM sites for a given test lysine alone with its peptide segment.
For a test or training peptide segment of length L, all its 2residues were selected except those containing the center lysine, 2 × (L − 1) 2-residues in total.As mentioned above, each 2-residue can be represented by a four-dimension vector.The vectors of all collected 2-residues are aggregated together to generate the representation of the peptide segment.For the peptide segment listed in Eq. 2, its feature vector is formulated as follows: where ⊕ is the concatenation operation.Accordingly, each peptide segment was represented by 8 × (L − 1) features.Given that the peptide segments in Qiu dataset were obtained by setting L = 13, they were represented by 96 features.For easy description, these features were called distribution features.

Prediction engine
Selecting or designing a proper classification algorithm is important when building efficient classification models.A novel algorithm that can capture the traits of such representation was designed on the basis of the feature representation of the peptide segment containing lysine at the center.As mentioned in Benchmark dataset section, this algorithm can process samples with multiple labels.
Multi-label Gaussian kernel regression (ML-GKR) is a widely used algorithm to set up multi-label classification models [29][30][31][32][33][34].In the present study, its variation was designed so that the new designed algorithm can more efficiently process aboveconstructed features.In accordance with the feature vector listed in Eq. 8, it aggregated four features of 2-residues in the sequence in order.The first feature was for acetylation, followed by those for crotonylation, methylation and succinylation, respectively.Thus, each component in the final feature vector is highly related to a certain type of PTM sites (acetylation, crotonylation, methylation or succinylation, which were termed as labels when constructing classification models).In ML-GKR, the score for each label is calculated using all features because these features are not special for any labels.In view of the trait of the feature vector, the score for one label can be calculated on the basis of the features related to this label only, thereby improving the accuracy of the score.The new method is referred to as multi-label Gaussian kernel precision regression (ML-GKPR).
Each peptide segment in a given training dataset containing a peptide segments, denoted by S tr = {P 1 , P 2 , • • • , P a }, can be represented by an 8 × (L − 1)-dimension feature vector according to Eq. 8.The feature vector of the i-th peptide segment is formulated as follows: and its observed labels can be formulated by a four-dimension binary vector, denoted by where l j i 1 ≤ j ≤ 4 is defined as follows: For a query peptide segment q containing lysine at the center, its feature vector is formulated by Its label vector can be determined as follows: The score for each label is calculated as follows: where θ is a parameter, and V q − V (P i ) 2 m is defined as According to S v , l v q is determined in the following manner: The classification model developed using the above features and prediction engine is called RMTLysPTM in this study.

Performance evaluation
Several validation methods can be used to evaluate the performance of classification models.Cross-validation is one of the most widely used methods.In this method, samples are equally and randomly divided into K parts.Each part is selected as a test set one by one and the remaining parts comprise the training set.The model built on the training set is applied to the test set.The average performance on each part is selected as the final performance of the classification model.In general, K is set to 5 or 10.Here, 5 was selected, i.e. 5-fold cross-validation was adopted to assess the performance of all classification models.According to the feature construction scheme mentioned in Feature construction section, the procedures are highly related to the training samples, that is, the representation of the same sample is not the same in different rounds of cross-validation.Therefore, samples were divided into five parts, and then each sample was encoded on the basis of this division.In this manner, the information of test samples could be rigorously excluded when training the classification model.
For a multi-label classification model, its cross-validation results can be counted as several measurements.Here, the same measurements used in previous studies [3,[23][24][25][26][27] were selected for easy comparison.These measurements included aiming, coverage, accuracy, absolute true and absolute false, which all have wide applications in evaluating the performance of multilabel classification models [35][36][37].Some notations are necessary to show how to compute them.For a dataset containing N samples and M labels, the observed labels of the i-th sample constitute the set L i , whereas its predicted labels comprise the set L * i .Then, the above measurements can be formulated as follows: where L i , L * i is an indicator, which is set to 1 if L i and L * i are identical; otherwise, it is set to 0. Clearly, the higher the aiming, coverage, accuracy and absolute true, the higher the performance.On the contrary, i.e. a low absolute false value suggests high performance.
Besides the above overall measurements, some measurements that are defined on each label were employed.For one label, the samples having this label are termed as positive samples, whereas the others are considered as negative samples.Then, the sensitivity (SN), specificity (SP), accuracy (ACC), precision and F1score for the i-th label can be calculated as follows: where TP(i), TN(i), FP(i) and FN(i) stand for the true positive, true negative, false positive and false negative for the i-th label, respectively.Furthermore, ROC and PR curves are plotted to fully display the models' performance on the i-th label, and the areas under these two curves are calculated, denoted by AUROC(i) and AUPR(i), respectively.

Parameter optimization
For the prediction engine mentioned in Prediction engine section, θ is an important parameter, similar to that in ML-GBK.Several values (1/2, 1/4, 1/6, 1/8, 1/16 and 1/32) were attempted for this parameter, and the corresponding model was built, which was   4. At first, with the decrease in θ , the performance improved, and when θ = 1/6, the model yielded the best performance for all overall measurements.Then, the performance reduced with the decrease in θ.Clearly, 1/6 was the best setting for θ .Thus, the model with θ = 1/6 was the proposed model of this study and called RMTLysPTM.

Performance of RMTLysPTM
As mentioned in Parameter optimization section, θ was set to 1/6 in RMTLysPTM.Its overall performance (measurements are listed in Eq. 20) under 5-fold cross-validation is listed in Table 1.Evidently, the performance was extremely high.In detail, aiming, coverage, accuracy and absolute true reached 0.9943, 0.9949, 0.9934 and 0.9909, respectively.All values were higher than 0.99.Meanwhile, the absolute false value was as low as 0.0023.The performance of the model on four types of PTM sites was evaluated.
The measurements (cf.Eq. 21) are listed in Table 2. Similar to the overall performance, all measurements on the four types were very high.Most were higher than 0.99, and only two values were between 0.98 and 0.99.Evidently, the performance of RMTLysPTM on each type was very high.In addition, the ROC and PR curves on each type were plotted, as shown in Figure 5.All the AUROC and AUPR values were higher than 0.99, further proving the high performance of RMTLysPTM.According to Figure 2, 1750 lysine PTM sites were not labeled by any type.Investigating whether these samples inf luence the performance of RMTLysPTM is necessary.Thus, these lysine sites were excluded, and RMTLysPTM was constructed again, which was called RMTLysPTM excluding lysine sites without labels to distinguish RMTLysPTM.Its performance under 5-fold crossvalidation is listed in Table 1.The aiming, coverage, accuracy and absolute true values were 0.9929, 0.9989, 0.9925 and 0.9856, respectively, and the absolute false value was 0.0036.Compared with the performance of RMTLysPTM, which is listed in Table 1, such performance slightly reduced.Each measurement decreased or increased by less than 1%.These results suggested that the lysine sites without any PTM type provided limited inf luence for RMTLysPTM.
RMTLysPTM showed excellent performance for recognizing the types of lysine PTM sites regardless of whether the special lysine sites (without any PTM type) were included or not.

Influence of sliding window size
In this study, the sliding window technique was adopted to generate the peptide segment for each lysine site.This technique is commonly used in PTM prediction.The sliding window size was set to 13 in RMTLysPTM (L = 13 in Eq. 2).Here, the sizes were set to 7, 9 and 11, and the performance of RMTLysPTM was evaluated with these sliding window sizes.The 5-fold crossvalidation results are listed in Table 3.For easy comparison, the overall performance of RMTLysPTM with a size of 13 is also listed in this table.The aiming, coverage, accuracy and absolute true values followed an increasing trend with the increase in sliding window sizes, whereas the absolute false value followed a contrary trend.The findings implied that the performance of RMTLysPTM enhanced when the sliding window size increased.When the size reached 13, the performance of RMTLysPTM was

Strict test on RMTLysPTM
The general 5-fold cross-validation randomly and equally divided peptide segments into five parts.The peptide segments derived from the same protein sequence may be in the training and test datasets in a certain round of 5-fold cross-validation.In this case, the evaluation results may be overestimated.Thus, a strict 5-fold cross-validation was adopted to fully test RMTLysPTM.All protein sequences were randomly and equally divided into five parts.In each round, the peptide segments derived from the sequences in one part were selected as test samples, whereas the others constituted the training samples.The peptide segments extracted from the same protein sequence can only be contained in the test dataset or training dataset, that is, the protein sequences involved in the test samples were not considered in the training of the model.This division was stricter than that in the general 5-fold cross-validation.RMTLysPTM was evaluated by such cross-validation method, and the results are listed in the last column of Table 1.The performance evidently reduced.The aiming, coverage, accuracy and absolute true values decreased to 0.9306, 0.9604, 0.9291 and 0.8955, respectively, whereas the absolute false value increased to 0.0275.Although such performance was inferior to that yielded by the general 5-fold cross-validation, it was still relatively high.The aiming, coverage, accuracy and absolute true values were still higher or close to/than 0.9.The performance of RMTLysPTM on the four PTM types under such test is provided in Table 4 and Figure 6.Most measurements in Table 4 were very high (≥0.9);three AUROC values were higher than 0.97, and two AUPR values were higher than 0.97, suggesting the high performance of RMTLysPTM.These results indicated that RMTLysPTM remained powerful in recognizing the types of lysine PTM sites from a new protein sequence.

Extensive comparisons of other models
Some models were constructed using popular methods and features and compared with RMTLysPTM to indicate its superiority.
In RMTLysPTM, ML-GKR was modified and a new prediction engine, called ML-GKPR, which can precisely capture the traits of the representations of peptide segments, was constructed.Here, ML-GKPR was replaced with ML-GKR to set up another model.The parameter θ was tuned, and θ = 1/4 gave the best performance.Such model was evaluated by 5-fold cross-validation.The overall performance is listed in Table 5.The aiming, coverage, accuracy and absolute true values were 0.9806, 0.9790, 0.9775 and 0.9728, respectively, which were all lower than those of RMTLysPTM.The absolute false value was 0.0069, higher than that of RMTLysPTM.This model evidently provided lower performance than RMTLysPTM.Furthermore, the detailed performance of the model on the four PTM types is listed in Supplementary Table S1, whereas the ROC and PR curves are provided in Supplementary Figure S1A and B, respectively.Compared with the corresponding results of RMTLysPTM (Table 2 and Figure 5), RMTLysPTM provided better performance.All these results proved that ML-GKPR was more suitable to deal with the representation of peptide segments proposed in this study than ML-GKR.ML-GKR and ML-GKPR are algorithm adaption methods for building multi-label classification models.Another popular method is problem transformation, which transforms the original problem into multiple single-label classification problems.
RAndom k-labELsets (RAKEL) [38] is one of the most classic problem transformation methods, and it was adopted in the present study to construct models.Random forest (RF) [39] and decision tree (DT) [40] were adopted as the base classification algorithm.RAKEL, RF and DT were implemented by Meka (http://waikato.github.io/meka/)[41].The models using different base classification algorithms (RF or DT) were constructed and evaluated by 5-fold cross-validation.The evaluation results are listed in Table 5.When RF was adopted as the base classification algorithm, the model provided aiming, coverages, accuracy, absolute true and absolute false values of 0.9948, 0.9946, 0.9913, 0.9784 and 0.0056, respectively.The aiming value was slightly higher than that of RMTLysPTM, the coverage and accuracy values were slightly lower, the absolute true value was about 1.3% lower and the absolute false value was about 0.3% higher.In general, RMTLysPTM was more powerful than the RAKEL model.The performance of such RAKEL model on four PTM types (Supplementary Table S1 and Supplementary Figure S1C and D) indicated that it was generally inferior to RMTLysPTM.When DT was used as the base classification algorithm, the performance of such model was quite low.The aiming, coverage, accuracy and absolute true values were lower than 0.7, and the absolute false value was higher than 0.1.Evidently, the RAKEL model was much inferior to RMTLysPTM.The same conclusion can be found by observing the performance of the RAKEL model on the four PTM types (Supplementary Table S1 and Supplementary Figure S1E and F).
Besides ML-GKPR, the reason for the superiority of RMTLysPTM was the representation of peptide segments.Here, the widely used position-specific scoring matrix (PSSM) was employed to represent each peptide segment.The PSI-BLAST [42] with Swissprot [43] database was used to generate the PSSM profiles for each peptide segment.These profiles further refined into a 540 (27 × 20)-dimension vector.The RAKEL algorithm was adopted to construct the model, where the base classification algorithm was RF or DT.The constructed model was also assessed by 5fold cross-validation.The prediction results are listed in Table 5.
The performance of such model was relatively low despite which base classification algorithm was selected.The aiming, coverage, accuracy and absolute true values were lower than 0.62, and the absolute false value was higher than 0.14.Such performance was greatly lower than that of RMTLysPTM.Moreover, the performance of the model on the four PTM types (Supplementary Table S1 and Supplementary Figure S2) was evidently lower than that of RMTLysPTM.Therefore, RMTLysPTM was much better than the model using PSSM features and RAKEL.The findings partly proved that the distribution features were more powerful than the PSSM features in recognizing the PTM types.Given the same prediction engine (RAKEL), the results in Table 5 showed that the models with distribution features were better than those with PSSM features, further confirming the results above.

Performance of RMTLysPTM on protein Q16778
Protein Q16778 was a test protein in Qiu et al.'s study, and it was used as an independent testing dataset in some previous studies [23,24,26].This protein was also employed to test the generalization ability of RMTLysPTM.The predicted and experimental results of the lysine sites in Q16778 are listed in Table 6.Sites 6-86 were correctly predicted.The aiming, coverage, accuracy, absolute true and absolute false values were 0.8250, 0.8250, 0.8167, 0.8000 and 0.0625, respectively.Although this performance was evidently lower than the training results (Table 1), it was still relatively high, suggesting that RMTLysPTM had a strong generalization ability.

Comparison with previous models
This study adopted the lysine PTM sites in Qiu dataset.This dataset has become a benchmark dataset, which has been used to test the performance of all previous models, because it is the first dataset containing multiple types on lysine PTM sites.The previous models were tested by 5-fold cross-validation, and their overall performance is listed in Table 7.Among them, iMul-kSite provided the best 5-fold cross-validation results on Qiu dataset.Its accuracy and absolute true values reached 0.9270 and 0.8877, respectively.CNN + SGT ranked the second, mLysPTMpred and predML-Site provided nearly equal performance, whereas MLysPRED and iPTM-mLys had evidently lower performance than the above models.The performance of these models was compared with that of RMTLysPTM.For easy comparison, the overall performance of RMTLysPTM under two types of 5-fold crossvalidation is listed in Table 7.Under the general 5-fold crossvalidation, RMTLysPTM provided much higher performance than the existing models.For example, the absolute true value of RMTLysPTM was at least 10% higher than those of the other models.Furthermore, the aiming, coverage and accuracy values were evidently higher.Meanwhile, the absolute false value was remarkably lower than others at lower than 0.01, whereas those of the other models were higher than 0.02.The overall performance of RMTLysPTM under a strict 5-fold cross-validation was almost equal to that of iMul-kSite and higher than the performance of other previous models.These results indicated that RMTLysPTM was superior to all existing models.As mentioned in Introduction, iPTM-mLys, predML-Site, mLysPTMpred and iMul-kSite considered each type of lysine PTM sites individually and thus ignored the mutual inf luence of different types.In this case, they cannot use the information of one type to predict another one, which is the main reason why their performance was lower than that of RMTLysPTM.Although CNN + SGT and MLysPRED were constructed by directly using multi-label classification algorithms, the features of peptide segments were not very informative.The features used in these methods were mainly extracted from a single sequence and did not include the label information.RMTLysPTM adopted the features by deeply analyzing the distribution of 2-residue in the training dataset on four labels, which were more informative than those in CNN + SGT and MLysPRED.Therefore, RMTLysPTM had higher performance than CNN + SGT and MLysPRED.
The performance of several previous models on protein Q16778 was tested, as in Performance of RMTLysPTM on protein Q16778 section.The overall performance is listed in Table 8.The test results of RMTLysPTM were included in this table for easy comparison.predML-Site provided the highest performance with the best values on all five measurements.The performance of RMTLysPTM was slightly lower than those of predML-Site and mLysPTMpred.However, it was still competitive.RMTLysPTM only correctly predicted one lysine site less than predML-Site because 20 lysine sites were only present in protein Q16778.Therefore, few lysine sites increased the prediction contingency.

Test results on CPLM training and testing datasets
In [27], new lysine PTM data derived from CPLM 4.0 were constructed.They contained two datasets, which were called CPLM training and testing datasets.Here, RMTLysPTM was evaluated on these two datasets.The sliding window size was set to 24 to give a fair comparison, as used in [27].The 5-fold cross-validation results on the CPLM training dataset are listed in Table 9.The aiming, coverage, accuracy, absolute true and absolute false values were 0.9983, 0.9986, 0.9977, 0.9958 and 0.0011, respectively.Such performance was relatively similar to that on Qiu dataset.Furthermore, the RMTLysPTM built on the CPLM training dataset was applied to the CPLM testing dataset, and its performance is listed in Table 10.The overall measurements were 0.9440, 0.9269, 0.9144, 0.8688 and 0.0431, respectively.Such performance was lower than that on the CPLM training dataset.However, the declined degree was not very large.The performance on the CPLM testing dataset was still relatively high, suggesting the strong generalization ability of RMTLysPTM.Given that the CPLM testing dataset was much larger than the independent testing set in Performance of RMTLysPTM on protein Q16778 section, such test results were more reliable.
The performance of RMTLysPTM on CPLM training and testing datasets was compared with those of the other models to show its superiority.The 5-fold cross-validation results of MLysPRED on the CPLM training dataset are provided in Table 9.Its performance was evidently lower than that of RMTLysPTM.The gaps on

Web server and user guide
For easy use of RMTLysPTM, a user-friendly web server with the same name was set up, which can be accessed at http://119.3.127.138/.The home page is shown in Figure 7. Users can submit a protein sequence and the web server can give the recognition result for each lysine site in this sequence.A step-by-step guide was provided below.
Step 1. Open the web server at http://119.3.127.138/ and input the protein sequence in the text box.Examples can be obtained by clicking the 'Example' button above the text box.
Step 2. Select the prediction model based on Qiu dataset or CPLM dataset below the text box.
Step 3.After inputting the protein sequence, click 'Submit' button to upload the sequence.The recognition result will be displayed in a new web page.Users can click 'Clean' to clear the current input and give a new input.
Step 4. In the result page, the locations of lysine sites in the sequence and the identified PTM types are listed.Using the 'Back' button, users can return the home page.
In this web server, the reference can be found by clicking 'Citation' button at the top of home page.The underlying dataset and codes are also provided.Users can click 'Download' button to obtain them.A notable detail that the feature representations of peptide segments were used to train the final classification model rather than those used in 5-fold cross-validation.

Limitations and future work
This model has some limitations.As shown in Figures 2 and 3, the lysine sites modified by four types were not equal.The sites modified by acetylation was evidently much more than those modified by the other three types.In another word, the Qiu and CPLM datasets were imbalanced.This problem was not considered when constructing RMTLysPTM.The addition of oversampling or undersampling techniques may improve the generalization ability of the model.Meanwhile, the features to represent peptide segment lacked diversity.When constructing RMTLysPTM, the distribution of 2-residue on the four types of lysine PTM sites was overemphasized, and the essential properties of the peptide segment may have been ignored.The model could be further improved by combining them.Finally, many lysine PTM types have been detected.For example, CPLM 4.0 collected nearly 30 types of lysine PTM sites.This study followed the previous ones and only investigated four types.If more lysine PTM types were employed, the labels could have been increased.Dealing with many labels is a challenging problem.In the future, this work will be continued to overcome the above limitations.

CONCLUSION
This study presented a multi-label classification model called RMTLysPTM to recognize the types of lysine PTM sites.Deep analysis was conducted to extract the distribution information of 2-residue with fixed location across four types to access the informative features of peptide segments containing lysine at the center.Powerful features were constructed on the basis of such information, and an efficient prediction engine that can capture the traits of the constructed features was designed.Such model is very powerful, and it can be an efficient tool to recognize types of lysine PTM sites.A user-friendly web server (http://119.3.127.138/) was set up for easy usage, along with the codes and underlying datasets.

Figure 2 .
Figure 2. Upset graph to illustrate four types of lysine PTM sites in Qiu dataset.Several lysine sites belong to two types, only nine sites belong to three types and no sites belong to all four types.1750 sites do not belong to any types.

Figure 3 .
Figure 3. Upset graph to illustrate four types of lysine PTM sites in CPLM training and testing datasets.(A) Upset graph for CPLM training dataset.(B) Upset graph for CPLM testing dataset.

Figure 4 .
Figure 4. Performance of the multi-label classification models under different values of θ in the prediction engine on Qiu dataset.When θ = 1/6, the model provides the best performance on all five overall measurements.

Figure 5 .
Figure 5. ROC and PR curves of RMTLysPTM for four types of lysine PTM sites in Qiu dataset.(A) ROC curves; (B) PR curves.The ROC and PR curves are nearly perfect and the AUROC and AUPR values are close to one, suggesting the extreme high performance of RMTLysPTM on four types of lysine PTM sites.

Figure 6 .
Figure 6.ROC and PR curves of RMTLysPTM for four types of lysine PTM sites in Qiu dataset under a strict test.(A) ROC curves; (B) PR curves.The AUROC and AUPR values are lower than those in Figure 5.The performance of RMTLysPTM on four types of lysine PTM sites reduced under a strict test.However, this performance is still relatively high.

Figure 7 .
Figure 7. Home page of the web server.

Table 1 :
Overall performance of RMTLysPTM on Qiu dataset

Table 2 :
Performance of RMTLysPTM on four types of lysine PTM sites in Qiu dataset

Table 3 :
Performance of RMTLysPTM under different sliding window sizes on Qiu dataset

Table 4 :
Performance of RMTLysPTM on four types of lysine PTM sites in Qiu dataset under a strict 5-fold cross-validation

Table 5 :
Performance of various models on Qiu dataset using popular methods and features